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Abstract. A deterministic BSP algorithm for constructing the suffix array of a given string 
is presented, based on a technique which we call accelerated sampling. It runs in optimal 
O(^) local computation and communication, and requires a near optimal O(loglogp) syn- 
chronisation steps. The algorithm provides an improvement over the synchronisation costs 
of existing algorithms, and reinforces the importance of the sampling technique. 
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1 Introduction 

Suffix arrays are a fundamental data structure in the string processing field. They have been 
researched extensively since their introduction by Manber and Myers |10ll3j . 

Definition 1. Given a string x — x[Q\ . . .x[n — 1] of length n > \, defined over an alphabet S, 
the suffix array problem aims to construct the suffix array SA^ — SAx[Q\ ■ ■ ■ SAx[n— 1] of x which 
holds the ordering of all the suffixes Si = x[i] . . . x[n — 1] of x in ascending order; i.e. SAx[j] — i 
iff Si is the j*'* suffix of x in ascending lexicographical order. 

1.1 Notation, Assumptions and Restrictions 

We assume zero-based indexing throughout the paper, and that the set of natural numbers includes 
zero. For any i,j £ N, we use the notation [i : j] to denote the set {a G N | i < a < j}, and [i : j) 
to denote {a G N | « < a < j}- 

The input of the algorithms to be presented in this paper is restricted to strings defined over 
the alphabet S — [0 : ri), where n is the size of the input string. This allows us to use counting 
sort throughout when sorting characters, in order to keep the running time linear in the size 
of the input. Counting sort is also used in conjunction with the radix sorting technique [3]. 

The set notation described above is extended to substrings by denoting the substrings of string 
xhy x[i : j), where x [i : j) = x[i] . . . x[j — 1]. Also, the end of any string is assumed to be marked 
by an end sentinel, typically denoted $, that precedes all the characters in the alphabet order. 
Therefore, to mark the end of the string and to ensure that any substring x [i : j) is well defined, 
for z G [0 : n) and j > i, we let x[k] = —1, for k > n. 

It should be noted that the algorithms to be presented in Sections [3j [5] can also be applied to 
any string X, of size n, over an indexed alphabet E' |13ll5j . which is defined as follows: 

— 17' is a totally ordered set. 

— an array A can be defined, such that, Vcr G S' , A[a] can be accessed in constant time. 

— < n. 
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Commonly used indexed alphabets include the ASCII alphabet and the DNA bases. It should 
also be noted that any string X, of size n, over a totally ordered alphabet can be encoded as a string 
over integers. This is achieved by sorting the characters of the string, removing any duplicates, 
and assigning a rank to each character. A new string X' of size n is then constructed, such that 
it is identical to X except that each character of X is replaced by its rank in the sorted list of 
characters. However, sorting the characters of X could require O(nlogn) time, depending on the 
nature of the alphabet over which X is defined. 

The example in Table [l] shows the suffix array for a string A", of size 12, over an indexed 
alphabet of a subset of the ASCII characters, written as string X' over S —[Q : 12). 



Table 1. Suffix array of a string X over an indexed alphabet, written as X' over X" = [0 : 12) 
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Let xi X2 denote the concatenation of strings xi and X2- Then, for any set of integers A, 
QieA is the concatenation of the strings indexed by the elements of A, in ascending index order. 
Throughout the paper we use |6| to denote the size of an array or string b. To omit [•] operations, 
we assume that any real numbers are rounded up to the nearest integer. 



1.2 Problem Overview 

The suffix array problem is, by definition, directly related to the sorting problem. In fact, if all 
the characters of the input string are distinct, then the suffix array is obtained by sorting the 
strings' characters and returning the indices of the characters in their sorted order. In general, 
if the characters of the string are not distinct, the naive solution is to radix sort all the suffixes, 
which takes O(n^) time if counting sort is used to sort the characters at each level of the radix sort. 
However, numerous algorithms exist that improve on this running time. The first such algorithm 
was presented by Manber and Myers [lOj and required O(nlogn) time. The running time was 
reduced to 0{n) through three separate algorithms presented by Karkkainen and Sanders [4], 
Kim et al. ilj, and Ko and Aluru f8^. A number of other algorithms exist with a higher theoretical 
worst case running time but faster running time in practice, as discussed in |13| . However, the 
study of these is beyond the scope of this work. 

The idea behind the algorithms having linear theoretical worst case running time is to use 
recursion as follows: 

1. Divide the indices of the input string a; into two nonempty disjoint sets. Form strings x' and 
y' from the characters indexed by the elements of each set. Recursively construct SA^i. 

2. Use SAx' to construct SAyi. 

3. Merge SA^' and SAyi to obtain SA^. 

The problem of constructing suffix arrays, while similar to the sorting problem, differs as 
follows. Given two sorted lists of integers, we are guaranteed that after merging them the order 
of the integers in the original lists is preserved. However, given two strings and their suffix arrays, 
the order of the suffixes is not necessarily preserved in the suffix array of the concatenated string. 
For example, the suffix arrays of strings aaa and aab are [2, 1, 0] and [0, 1, 2] respectively, but the 
suffix array of string aaaaab is [0, 1, 2, 3, 4, 5]. 

The aim of this paper is to investigate the suffix array construction problem in the Bulk Syn- 
chronous Parallel (BSP) model, on a p processor distributed memory system. As in the sequential 
setting, the naive general solution to the problem is to radix sort all the suffixes of the string. Shi 
and Shaeffer |14) provide a comparison based parallel sorting algorithm, using a technique known 



Table 2. Size of the difFerence cover obtained using the algorithm in 2 for various values of v 
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as regular sampling, which is then adapted by Chan and Dehne Jj for integer sorting. However, 
using such a technique to sort the suffixes of a given string of size n leads to a parallel algorithm 
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with O(^) local computation cost, 0{n) communication cost and requiring 0(1) synchronisation 
steps. Clearly, it is more efficient to simply use a linear time sequential algorithm. 

Karkkainen et al. [5j give a brief overview of a BSP suffix array construction algorithm having 
optimal O(^) local computation and communication costs and requiring O(log^p) synchronisation 
steps. They also present similar algorithms in the PRAM model. In this paper we further reduce 
the number of synchronisation steps required to a near optimal O (log log p), while keeping the 
same optimal local computation and communication costs. The algorithm is based on a technique 
that we call accelerated sampling. This technique was introduced (without a name) by Tiskin [T7] 
for the parallel selection problem. An accelerated sampling algorithm is a recursive algorithm that 
samples the data at each level of recursion, changing the sampling frequency at a carefully chosen 
rate as the algorithm progresses. 

1.3 Paper Structure 

The rest of the paper is structured as follows. The next section provides an overview of the 
concept of difference covers. The sequential suffix array construction algorithm is given in Section 
|3j An overview of the BSP model is provided in Section |4j and a description of the parallel suffix 
array construction algorithm in this model is presented in Section [5] The last section offers some 
concluding views and discusses possible future work. 

2 Difference Covers 

The suffix array construction algorithms to be presented in this paper make use of the concept of 
difference covers ,2 6 12 . Given a positive integer v, let Zy denote the set of integers [0 : v). A set 
D C Zy can be defined such that for any z G Z^,, there exist a,b Cz D such that z = a — b (mod 
v). Such a set D is known as a difference cover of Z„, or difference cover modulo v of Z^,. 

Colbourn and Ling [5] present a method for obtaining, for any v, a difference cover D of Zy in 
time 0{y/v), where \D\ = 6r + 4, r = -36+V48+96i. ^ Hence, \D\ < ^/^5w + 6. Note that, in general, 
for any v and any difference cover D of Z„, \D\ > ^+^^'"~^ ^ since we must have |L)|(|I?| — 1) + 1 > v. 
Therefore, the size of the difference cover obtained by using the algorithm in [2 is optimal up to 
a multiplicative constant. 

The algorithms to be presented in this paper require that \D\ < v, so we assume v > 3. The 
optimal difference covers of Z3, Z4 are of size 2, 3 respectively, and for v > 5 the method of [5] 
gives difference covers of sizes given in Table [2[ 

For technical reasons, discussed in Section|3j we also require that ^ D. This does not represent 
a restriction since, for any v and difference cover _D of Z^,, a fixed z € Z^, can always be chosen 
such that the set D' = {{d — z) mod v \ d E D} is also a difference cover of Z^, (see e.g. |12|). 

The following lemma is also required to ensure the correctness of the algorithms to be presented. 

Lemma 1. [5| If D is a difference cover ofZy, and i and j are integers, then there exists I £ [0 : v) 
such that [i + I) mod v and (j + I) mod v are both in D. 

For any difference cover D of Z^, and integer n > f , a difference cover sample is defined as 
C = {i e [0 : n) I i mod v G -D}. The index set C is a u-periodic sample of [0 : n), as defined in 
[5]. The fact that difference cover samples are periodic allows them to be used for efficient suffix 
sorting on a given string. 



3 Sequential Algorithm 



Karkkainen et al. [5] present a sequential recursive algorithm that constructs the suffix array of a 
given string x of size n, using a difference cover D of Zy, for any arbitrary choice of u € [3 : n], 
in time 0{vn). Clearly, by setting w = 3 the running time of the algorithm is 0{n), with a 
small multiplicative constant. As v approaches n the running time approaches O(n^), and when 
V = n the algorithm is simply a complex version of the naive suffix array construction algorithm. 
However, by initially letting v = 3 and increasing the value of u at a carefully chosen rate in every 
subsequent level of recursion, we can reduce the total number of recursion levels required for the 
algorithm to terminate, while still keeping the total running time linear in the size of the input 
string. This technique can be used to decrease the number of synchronisation steps required by the 
parallel sufRx array construction algorithm in the BSP model. This is discussed further in Section 
[Sj The detailed sequential algorithm proceeds as follows: 
Recursion base 

We sort X using counting sort, in time 0{n). If all the characters of x are distinct we return, 
for each character, in the sorted order, the index of the character in x, i.e. SA^. Otherwise, the 
following steps are performed: 

Algorithm 1. Sequential Suffix Array Construction 

Parameters: integer n; integer u e [3 : n] 

Input: string x = a;[0] . . . x[n — 1] over alphabet U — [0 : n) 

Output: suffix array SA^ = Svl^rp] . . . SAx[n ~ 1] 

Description: 

Stey - Sample construction and initialisation 

Construct the difference cover D of Zt, as discussed in Section [2j Then, for each fc S [0 : w), 
define the set Bk = {i € \^ : n) \ i mod v — k}. This partitions the set of indices of x into v sets 
of size about The difference cover sample C = UfcgD is then constructed. For i G C, we call 
the characters x[i] sample characters and the sufhxes sample suffixes. We also denote by Sk, 
fee [0 : ?;), the set of suffixes Si, i £ Bk- 

Furthermore, an array rank of size n + v is declared and initialised by rank[0] = . . . = 
rank[n + u — 1] = — 1. This array will be used to store the rank of the sample characters of x in 
the sufhx array returned by the recursive call made later in step 1. Only |C| elements of rank will 
be used, and in fact a smaller array can be used to hold these values. However, we use a larger 
array to avoid complex indexing schemes relating elements in rank to characters in x. 
Step 1 - Sort sample suffixes 

Let S be an alphabet of super-characters, which arc defined to be in 1-1 correspondence with 
the distinct substrings of x of length v: super-character x[i : i + v) corresponds to the substring 
x[i:i + v), for all i £ C. Therefore, S C (^S U {—1})". Recall from Section [I] that, due to the 
padding convention, any substring x [i : j) is well-defined, for is [0 : n) and j > i, and therefore 
any super-character x [i : j) is also well-defined. 

For each fc G 13, we now define a string of super-characters over S, where Xk — OieSfc x[i : i + v) 
and \Xk\ — ^. Then, we construct the string of super-characters X = Qi^^jj Xk, with |X| = \D\^. 
Note that for each fc, the suffixes of X^ correspond to the set of suffixes S).- The last super-character 
of Xk ends with one or more —1 elements, since is not allowed to be in the difference cover. 
Therefore, each suffix of X corresponds to a different sample suffix of x, followed by one or more 
— 1 characters followed by other characters that do not affect the lexicographic order of the suffixes 
of X. Note that, if was allowed in the difference cover and n was a multiple of v, then the last 
super-character of Xk would not end with —1. 

Recall from Section [T] that since the input to the algorithm is a string over natural numbers, 
the string of super-characters X can be encoded as string X' over S' = [0 : |X|) using radix 
sorting, in time 0(w|Ar|), where \X'\ = \X\ = \D\^. The order of the suffixes of X can then be 
found by constructing the suffix array of X by recursively calling the algorithm on the string 
A' over Z", with parameters |A'| and v' , where v' can be chosen arbitrarily from the range 



3 : min (^^jy — 1, \X'\j . Thus, v' becomes the value of v in the subsequent recursion leveL The 
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bound v' < yjjy ensures that the total work performed by the algorithm is still linear in n. 

Recall from Section [2] that we require |D| < v. This ensures that \X\ < n, so the algorithm is 
guaranteed to terminate, since each recursive call is always made on a shorter string. In fact, if 
the parameter v remains constant over all the levels of the recursion (say u = 3), then in each level 
the size of the string is reduced by a factor of (a factor of | for z; = 3, \D\ = 2). However, by 
carefully increasing the value v in every round, within the bounds specified above, we can reduce 
the number of recursion levels of the algorithm by accelerating the rate of string size reduction in 
each successive level of recursion, as discussed in detail in Section [5] 

When the recursive call returns with SAx'^ this holds the ordering of all the suffixes of X' , 
i.e. the ordering of the sample suffixes of x within the set of sample suffixes. Then, for i € C, the 
rank of Si in SAx' is recorded in rank[i]. Note that the order of the sample suffixes within each 
set 5fc, k Cz D, can be found from SAx'- 

The total cost of this step is dominated by the radix sorting procedure required to encode 
string X into X' over S' — [0 : \X\), which runs in time 0{\D\n). 

Note that we can now compare any pair of suffixes by the result of Lemma [l] However, this is 
not sufficient to sort the suffixes of x in linear time, since a different value of I would have to be 
found for each pair of suffixes and linear time sorting would not be possible. Instead, we perform 
the following steps. 

Step 2 - Find the order of the non-sample sujjixes within each set Sk, k £ 1,^ \ D 

For each k S 'Zy\D, consider any G [1 : v) such that (k + lk) mod v £ D. For every character 
x[i], i £ [0 :n)\C, define the tuple ti = + 1], . . . ,x[i + Ik — l],rank[i + Ik]), where k — i 

mod V. Note that rank[i + Ik] is defined for each i, since rank[a], for all a £ C, has been found in 
the previous step and rank[a] = —1 for all a > n. 

Then, for each set Bk, k £ Z„ \ Z?, construct the sequence of tuples {ti)i^Bk- Each of the 
— |D| constructed sequences has about ^ tuples, with each tuple having less than v elements. 
The order of the suffixes within Sk is then obtained by independently sorting every sequence of 
tuples {ti)i^BkJ using radix sorting. 

The total computation cost of this step is dominated by the cost of radix sorting all the 
sequences, i.e. O {{v — \D\) n) = 0{vn). 
Step 3 - Sort all suffixes by first v characters 

Note that in the previous steps the order of every suffix within each set Sk, k £ [0 : v), has 
been found. Now, let 5" be the set of suffixes starting with a, for a £ {SU {—1})". Then, every 
set S" is composed of ordered subsets S^, where S^ = S" f]Sk- 

All the suffixes Si, i £ [0 : n), are partitioned into the sets S"' by representing each suffix by 
the substring x[i : i + v), and sorting these substrings using radix sorting in time 0(vn). 
Step 4 - Merge and complete the suffix ordering 

For all a £ S"", the total order within set S"" can be obtained by merging the subsets S^, 
k £ Tjy. This comparison-based t;-way merging step uses the fact that all the suffixes in x" start 
with the same substring a, in conjunction with Lemma 1. Due to this lemma, a value I £ [0 : v) 
exists such that for any i,j the comparison of suffixes Si, Sj only requires the comparison of 
rank[i + I] and rank[j + I]. Having already partitioned the suffixes into sets 5" and found the 
order of the suffixes within each set Sk, k £ [0, v), the suffix array can be fully constructed through 
this merging process in time 0{vn). n 

All the steps of the algorithm can be completed in time 0{vn), and the recursive call is made 
on a string of size at most which corresponds to \D\ — 4, v — 5. This leads to an overall 
running time of 0{vn). 



4 BSP model 



The bulk- synchronous parallel (BSP) computation model |18|llj was introduced by Valiant in 1990, 
and has been widely studied ever since. The model was introduced with the aim of bridging the 



gap between the hardware development of parahel systems and the design of algorithms on such 
systems, by separating the system processors from the communication network. Crucially, it treats 
the underlying communication medium as a fully abstract communication network providing point- 
to-point communication in a strictly synchronous fashion. This allows the model to be architecture 
independent, promoting the design of scalable and portable parallel algorithms, while also allowing 
for simplified algorithm cost analysis based on a limited number of parameters. 

A BSP machine consists of p processors, each with its local primary and secondary memory, 
connected together through a communication network that allows for point-to-point communica- 
tion and is equipped with an efficient barrier synchronisation mechanism. It is assumed that the 
processors are homogeneous and can perform an elementary operation per unit time. The commu- 
nication network is able to send and receive a word of data to and from every processor in g time 
units, i.e. g is the inverse bandwidth of the network. Finally, the machine allows the processors to 
be synchronised every I time units. The machine is, therefore, fully specified using only parameters 
p, g, I, and is denoted by BSP{p, g,l). 

An algorithm in the BSP model consists of a series of supersteps, or synchronisation steps. In 
a single superstep, each processor performs a number of, possibly overlapping, computation and 
communication steps in an asynchronous fashion. However, a processor is only allowed to perform 
operations on data that was available to it at the start of the superstep. Therefore, in a single 
superstep, a processor can send and receive any amount of data, however, any received data can 
only be operated on in the following superstep. At the end of a superstep, barrier synchronisation 
is used to ensure that each processor is finished with all of its computation and data transfer. 

The cost of a BSP superstep on a BSP{p, g, I) machine can be computed as follows. Let worki 
be the number of elementary operations performed by processor i-^, i G [0 : p), in this superstep. 
Then, the local computation cost w of this superstep is given hy w = maxi^^Q-p) (worki). Let 
and ft.™ be the maximum number of data units sent and received, respectively, by processor Pi, 
i € [0 : p), in this superstep. Then, the communication cost h of this superstep is defined as h — 
maXjg[Q.p-)(/i°"*)-|-maXjg[o.p)(/i™). Therefore, the total cost of the superstep is w + h- g+l. The total 
cost of a BSP algorithm with S supersteps, with local computation costs Ws and communication 
costs hs , s & [0 : S) , \s W + H ■ g + S ■ I , where W = X]s=o total local computation cost 

and H — X]f=o^ is the total communication cost. 

The main principle of efficient BSP algorithm design is the minimisation of the algorithm's 
parameters W, H, and S. These values typically depend on the number of processors p and the 
problem size. 

5 BSP Algorithm 

Along with the sequential suffix array construction algorithm, described in Section [3] Karkkainen 
et al. [5] discuss the design of the algorithm on various computation models, including the BSP 
model. They give a brief overview of a parallel suffix array construction algorithm, running on a 
BSP{p, g, I) machine, with optimal 0{^) local computation and communication costs and requiring 

O(log^p) synchronisation steps. The algorithm uses a number of existing parallel sorting and 
merging algorithms to achieve this result. In the first part of this section we present a deterministic 
BSP algorithm that preserves these optimal local computation and communication costs while 
reducing the number of required synchronisation steps to a near optimal O (log log p). Following 
this, a detailed algorithm analysis is presented. 

The algorithm described in Section [3] initially solves the suffix array construction problem 
on a sample of the suffixes of the input string, in order to gain important information that is 
then used to efficiently sort all the suffixes. Sampling techniques are widely used in various fields 
ranging from statistics to engineering to computer science. In fact, a number of parallel algorithms 
exist that use sampling to efficiently solve problems, including sorting 14 IJ and convex hull 116) 
algorithms. In jlTj . Tiskin presents a BSP algorithm for the selection problem, in which, not only is 
the data sampled, but, the sampling rate is increased at a carefully chosen rate in successive levels 
of recursion. This reduces the number of synchronisation steps required by the parallel selection 



algorithm from the previous upper bound of 0{\ogp) to a near opthnal O (log log p), while keeping 
the local computation and communication costs optimal. We make use of this technique, which 
we call accelerated sampling, to achieve the same synchronisation costs for our parallel suffix 
array construction algorithm, while, again, keeping the local computation and communication 
costs optimal. In contrast with [T7] , in our algorithm the sampling frequency has to be decreased, 
rather than increased, in successive levels of recursion. 

The algorithms presented in this section are designed to run on a BSP{p, g, I) machine. We 
denote the sub-array of an array a assigned to processor ir G [0 : p) by and extend this notation 
to sets, i.e. we denote by A.^ the subset of a set A assigned to processor tt. 

Before detailing our algorithm, we give an overview of the parallel integer sorting algorithm 
introduced in [1 . The algorithm is based on the parallel sorting by regular sampling algorithm |14j . 
but uses radix sorting to locally sort the input, removing the extra cost associated with comparison 
sorting. Given an array y having m distinct integers, such that each integer is represented by 
at most K digits, the algorithm returns all the elements of y sorted in ascending order. If two 
integers are identical, then their index in the array y is used to determine their relative order, 
i.e. for two identical integers y[i] = y[j], we assume that y[i\ precedes y[j] if « < J and y[i] 
succeeds y[j] otherwise. Since the presented suffix array construction algorithm runs on strings 
over = N U {— 1}, then we can use the same algorithm, which we refer to as the parallel 
string sorting algorithm, to sort an array of m strings or tuples, each of fixed length k. In this 
case, the algorithm has O(k^) local computation and communication costs and requires 0(1) 
synchronisation costs. The algorithm, given an input array y oi m strings over S, with each string 
of length at most k, works as follows. 



Algorithm 2. Parallel String Sorting 
Parameters: integer m > p^; integer k, 

Input: array of strings y — 2/[0] . . . y[m — 1], with each string over S — Ij and of size k 

Output: array y ordered in ascending lexicographic order 

Description: 

The input array y is assumed to be equally distributed among the p processors, with every 
processor tt G [0 : p — 2], assigned the elements y ^7r:^(7r-|-l)j, and processor p — 1 assigned 



elements y ^ (p — 1) : mj . Note that each processor holds ^ elements, except the last processor 

p — 1, which may hold fewer elements. We call this type of distribution of elements among the p 
processors a block distribution. Each processor tt first locally sorts sub- array y-^, using radix sorting, 
and then chooses p + 1 equally spaced samples from the sorted sub-array, including the minimum 
and maximum values of y^. These samples, which we call primary samples, are sent to processor 
0. Having received (p + l)p primary samples, each of which is a string of length k, processor 
locally sorts these samples, using radix sorting, and chooses p -I- 1 sub-samples, including the 
minimum and maximum values of the primary samples. These chosen sub-samples, which we call 
secondary samples, partition the elements of y into p blocks Yq, . . . , Yp-i. The secondary samples 
are broadcast to every processor, and each processor tt then uses the secondary samples to partition 
its sub-array y^^ into the p sub-blocks Yq ^,-, . . . ,1^-1.,^. Each processor tt collects the sub-blocks 
Fjr.x from processors x G [0 ■ p), i.e. all the elements of F^, and locally sorts these elements using 
radix sorting. The array y is now sorted in ascending lexicographic order, however, it might not 
be equally distributed among the processors, so an extra step is performed to ensure that each 
processor has ^ elements of the sorted array. Note that each primary and secondary sample also 
has the index of the sample in y attached to it, so that any ties can be broken. □ 

The parallel suffix array construction algorithm presented below requires that the input string 
X of size n be distributed equally among the p processors, using a block distribution, prior to 
the algorithm being called. Therefore, each processor tt e [0 : p — 2] initially holds the elements 

X ^TT : {tt + 1)^ , while processor p — 1 holds elements x ^ (p — 1) : We denote by the 
subset of the index set [0 : n) that indexes x^, tt S [0 : p), i.e. /^r = ^tt : ^ {tt + l)j , for tt € 



[0 : p — 2], and = ^ (p — 1) : We require that every processor tt S [0 : p — 2], also holds 
a copy of the first w — 1 characters of the substring x^^+i, where w is a parameter of the algorithm, 
to be able to locally construct its subset of super-characters. Finally, we we use the same indexing 
for a and Ott, i-e. a[i] — a^[i]. 

The algorithm is initially called on string x of length n, with parameters n and w = 3. 

Algorithm 3. Parallel Suffix Array Construction 

Parameters: integer n> p^; integer v G [3 : n] 

Input: string x = x[0] . . . x[n — 1] over alphabet U = [0 : n) 

Output: suffix array SAj; = SA^lO] . . . SAx[n - 1] 

Description: 

Recursion base 

Recall that if all the characters of x are distinct, then SA^ can be obtained by sorting the 
characters of x in ascending order. Therefore, we call Algorithm 2 on string x with parameters 
m = n and k = 1. When the algorithm returns with the sorted list of characters, which we call a;', 
each processor tt, holds the sub-list x'^ of size ^, and checks for character uniqueness in its sub-list. 
If all the characters of each sub- list are distinct, then, each processor tt g [0 : p — 2], checks with 
its neighbour tt 4- 1 to ensure that x'[^(Tr + 1) — 1] x'[^-^{tt + 1)]. If every character is distinct 
then each character in the sorted list x' is replaced by its index in x and x' is returned. However, 
if at any point in this process two identical characters are found, then the following steps are 
performed: 

Step - Sample construction and initialisation 

Every processor tt, constructs the difference cover D of Z„ as discussed in Section [2] Then, 
each processor tt, for each k E [0 : v), defines the subset = {i E I-^ \ i mod v = k}. This 
partitions each set of indices Bk into p subsets of size about ^ . The subset of the difference 
cover sample C is then constructed by every processor tt, such that Ctt = UkeoBk-n-- We denote 
by SkTT, k £ [0 : v) and tt S [0 : p), the set of suffixes Si, i G i?A;;r- 

Finally, every processor tt also declares the array rankT^ , of size ^ + f for tt G [0 : p — 2] , and 
size n— ^(p— l)-|-ufor7r=p— 1. Each element of rank^^ is initialised by -1. Note that the size of 
each rank.^ , tt G [0 : p — 2], follows from the fact that each such processor requires a copy of the 
first V elements of rank-^^i in order to be able to locally construct the tuples associated with all 
the non-sample characters in x^^. 
Step 1 - Sort sample suffixes 

For every processor tt, we define, for each k £ D, the substring of super-characters Afc^r = 
OieBk 2; [i : i -I- t;), such that |Afe^| = Note that every substring x[i : i + v) is locally available 
for all i G C^, due to the padding convention and the distribution of x among the processors. Then, 
construct the string of super-characters X, as discussed in Section [3j This string is distributed 
among the p processors, with each processor having \D\^ super-characters. Note that it is not 
necessary to actually construct X, since the position of each A^^, and, therefore, the index of each 
super-character x[i : i + v), i £ C, in X can be calculated by every processor tt. However, this is 
done for simplicity. Algorithm 2 is then called on string X with parameters m — and k = v. 
After sorting, a rank is assigned to each super-character in its sorted order, with any identical 
super-characters given the same rank, and the string X' is constructed, as discussed in Section [3| 
Note that X' is already equally distributed among the processors. 

The algorithm is then called recursively on the string A' with parameters n — |A'| and 
v' = min(w^/'', |A'|), where v' is the value of v in the subsequent recursion level. If |A'| < ^, then 
A' is sent to processor 0, which calls the sequential suffix array algorithm on A' with parameters 
n = |A'| and t; = 3. A detailed discussion on the bound of v' — min(i;''/'', |A'|) and its impact on 
the synchronisation costs of the algorithm is given later in this section. 

When the recursive call returns with SAx', the rank of each in SAx', i G Cfc^ri S [0 : p), is 
recorded in rank^^. Also, a copy of the first v elements of rank^^, for tt G [1 : p), is kept in rank^^-i. 
The order of each suffix Si within each set Sk, k £ D, is stored by each processor tt, for i G 1^^. 
Step 2 - Find the order of the non- sample suffixes within each set Sk, k € D 



For each k e Zi,\D, consider any G [1 : v) such that (k + lk) mod v G D. We define the tuple 
ti = (x[i],x[i + 1], . . . ,x[i + lk — l],rank[i + Ik]), for each character x[i], z £ /jr \ Cjr, vr e [0 : p) and 
k = i mod V. Note that every character in the tuple can be constructed locally on processor tt. 

Then, every processor tt € [0 : p) constructs the subsequence of tuples {tiji^Sk^i for each subset 
Sfcjr, k "Ey \ D. Therefore, each sequence (ti)i^Bk is the concatenation of the subsequences 
{'ti)ieBk„ ill ascending order of tt. Recall from Sectionjs] that the number of sequences {ti)i^Bk to 
be sorted \s v — and that each sequence contains ^ tuples, of length at most v. Therefore, 
each processor holds about ^ tuples of each sequence. 

Each sequence is then sorted using Algorithm 2 with parameters m = ^ and k, being the length 
of the tuples in the sequence, which is at most v. After each sequence is sorted, the order of each 
non-sample suffix Si within each set Sk, fc € Z^, \ D, is stored by each processor n, i € 
Step 3 - Sort all suffixes by first v characters 

Let each suffix Si,i G [0 : n), of x be represented by the substrings x[i : i + v). These substrings 
are sorted using Algorithm 2 with parameters m = n and k — v. The index of each substring in 
X is used to determine the order of identical substrings. After sorting, the suffixes of x will have 
been partitioned into the sets S", a £ {S L> {—1})", as discussed in Sectionjs] 
Step 4 - Merge and complete the suffix ordering 

Recall from Section [s] that, each set 5", a £ {SU {—1})", is partitioned into at most v subsets 
S"^, k S [0 : w), and that the order of the suffixes within each such subset has been found in 
the previous steps. Ordering a set S" is achieved through a w-way merging procedure based on 
Lemma [l] For every two subsets S^, and S*^,,, k' , k" e [0 : u), we choose any Z e [0 : w) such that 
(k' + I) mod V and (fc" + I) mod v are both in D. Then, comparing two suffixes Si £ Sk' and 
Sj £ Sk" only requires the comparison of rank[i + I] and rank[j + I]. 

Therefore, in order to sort every element of S°' we require, for each element of 5", the order 
of the element within the subset S^ it belongs to and at most \D\ values from the array rank. 
Hence, at most {\D\ + 1)^ values need to be received by each processor. Note that the order of 
each suffix Si, i £ [0 : n), within the set Sk, i mod v = k, is stored on processor tt, i g as is 
rank[i + I], for any I £ [0 : v). 

After the sorting procedure in the previous step, the suffixes of a set S*", a £ , are contiguous 
and can be either contained within a single processor, or span two or more processors. If S" is 
contained within one processor, then this processor locally merges each of the subsets of S". If 
the set spans two processors n' ,tt" £ [0 : p), then, for each of the suffixes Si £ S" , i £ [0 : n), on 
processor tt" , the values required to merge Si into the ordered S" are sent to tt'. Processor tt' then 
locally merges each of the subsets of S". Otherwise, if the set S" spans more than two processors, 
the following procedure is performed. 

Let p' be the number of processors that the set S" spans. Then, S" is equally divided among 
the p' processors, such that each is assigned elements. Again, note that the actual suffixes 
Si £ S" , i £ [0 : n), are not communicated, but only the values required by the merging process 
are, i.e. at most |D| + 1 values for each suffix in 5'". 

Each of the p' processors locally sorts its assigned elements of S*", using the v-wa.y merging 
procedure, and chooses p' + 1 equally spaced primary samples from the sorted elements, including 
the minimum and maximum elements. Every primary sample is sent to one of the p' processors 
that is chosen as the designated processor. Therefore, this designated processor receives {p' + l)p' 
primary samples, which it sorts locally using the u-way merging procedure. It then chooses p' + 1 
equally spaced secondary samples from the merged primary samples, including the minimum 
and maximum primary samples, that partition S" into p' blocks. These secondary samples are 
broadcast to the p' processors such that each processor can partition its assigned elements into 
p' sub-blocks. Every processor then collects all the sub-blocks that make up a unique block and 
locally merges the received elements. 

Note that a processor can only have elements from at most two sets that span across three or 
more processors. Therefore, this procedure can be done in parallel for each set S". After all the 
sets S" have been sorted, all the suffixes of x have been ordered and the sufhx array is returned. 
□ 
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5.1 Algorithmic Analysis 

The presented suffix array construction algorithms are recursive, and the number of levels of 
recursion required for the algorithms to terminate depends on the factor by which the size of the 
input string is reduced in successive recursive calls. While the number of levels of recursion does not 
influence the running time of the sequential algorithm, in BSP this determines the synchronisation 
costs of the algorithm, and, therefore, we want to reduce it to a minimum. Before detailing the 
costs of each step of the algorithm we explain how changing the sample size at each subsequent 
level of recursion results in O (log log p) levels of recursion. 

We refer to each level of recursion of the algorithm as round i, i > 0. Then, we denote by Ui, Vi 
and Di the size of the input string, the parameter v and the difference cover D of Zu^, respectively, 
in round i. 

Recall from Section [2] that, the maximum size of a difference cover D oi Zy, for any positive 
integer v, that can be found in time 0{^/v) is ^/l.5v + 6, i.e. \D\ = 0{v^^^). Therefore, for the sake 
of simplicity, in our cost analysis we assume that = ti/^^. 

The analysis given in Table [3] shows how changing the sampling rate affects the parameters v 
and n in subsequent recursive calls. Recall from Section [3] that, the cost of each level of recursion 
in the sequential algorithm is 0{vini). Therefore, the table also shows that the order of work done 
decreases in subsequent recursive calls. 

The results in Table [3] clearly show that, if the algorithm is initially called on a string of size 
n, with parameter w = 3, on a BSP{p, g, I) machine, then the size of the input converges towards 
^ super-exponentially. In fact, after log5/4(log3P^/^ + 1) = O(loglogp) levels of recursion, the size 
of the input string is O(^), and in the subsequent level of recursion the suffix array is computed 
sequentially on processor 0. Note that the value | as a power of v is not the only one possible. In 
fact, any value 1 < a < | can be used, but a = | is used for simplicity. Finally, note that Vi > rii 
only after O (log log p) levels of recursion, at which point the algorithm is called sequentially on a 
single processor. 

Having determined the number of recursive calls required by the algorithm, the cost of each 
step of the algorithm is analysed. For each step, the costs of the first round of the algorithm are 
specified below, along with the costs of round log^^^llog^p^^'^ + l)j which we call the critical rounds 
since this is the round immediately before the algorithm is called sequentially on a string of length 
less than -. 

V 

In the recursion base, the costs are dominated by those of Algorithm 2, i.e. O(^) local com- 
putation and communication cost. Therefore, in the first round the local computation and com- 
munication costs are O(^), and in the critical round these costs are O(^). A constant number of 
synchronisation steps is required. 

In step 0, constructing the difference cover Di has running time O(y^), i.e. 0(1) in the first 
round and 0(p^/*) in the critical round. Constructing the subsets C^r, independently for each 
processor tt e [0:p), has 0{\Di\^) local computation cost, i.e. O(-) in the first round and 

-lyj) in the critical round. Finally, declaring and initialising rank^ requires 0{^+Vi) work, i.e. 



O(-) in the first round and O(^) in the critical round. A single synchronisation step is required, 
wit?i no communication. 

In step 1, the costs are dominated by the construction of the string of super-characters X 
and the call to Algorithm 2, leading to 0{\Di\^) local computation and communication costs. 
Therefore, the costs of this step in the first round are O(^) local computation and communication, 
while in the critical round these costs are 0{^^). A constant number of synchronisation steps is 
required in each round. 

In step 2, the costs are again dominated by the call to Algorithm 2 for each sequence of tuples. 
The size of each sequence is and the size of each tuple is at most vi. Therefore, the local 
computation and communication costs to sort each sequence are O(^), i.e. O(^) in the first 
round and O(^) in the critical round. Each sequence is sorted independently using Algorithm 2, 
and, since the number of sequences to be sorted, Vi — \Di\, is always less than p, then each sequence 
can be sorted in parallel in each round by having a different designated processor for each call to 
Algorithm 2. Therefore, the number of synchronisation steps required is always constant. Recall 
that Algorithm 2 requires slackness, m > p^. Since, in the critical round. Algorithm 2 is called 
on a sequence of length 9^, we require that n > pl'^. Note that this slackness can be reduced 
by sorting sequences locally if each sequence fits on a separate processor, however, such detail is 
beyond the scope of this paper and will be given in a journal version of this paper. 

The cost of step 3 is simply the cost of Algorithm 2 on a string of size rii with k = Vi, i.e. 
0{vi^) local computation and communication costs and 0(1) synchronisation steps. Therefore, 
in the first round the local computation and communication costs are O(^) and these costs in the 
critical round are 0(^^)- 

In step 4, obtaining, for each suffix of x, the information required for the sorting each set 5" 
using a w-way merging procedure has 0{\Di\^) local computation and communication costs, i.e. 
O(^) in the first round and 0{^^) in the critical round. Then, sorting a set S" that is contained 
on a single processor has 0(|S'"|wi) local computation costs, and no communication is required. 
Note that in this case jS""! < so the local computation costs are O(^) in the first round and 
0{:^) in the critical round. If S*" spans two processors, then we send all the elements of the 
set to one of the two processors. Therefore, since each processor has y suffixes, |S'"| < 2y, so 
the costs of sorting this set are O(wj^) local computation, 0{\Di\^) communication and 0(1) 
synchronisation steps; i.e. O(^) local computation and communication costs in the first round and 
0{:^) in the critical round. 

Finally, if a set S°' spans p' > 2 processors, then IS"] > (p' — In this case a procedure 

similar to the parallel radix sorting algorithm on p' processors is performed. In fact, the only 
difference between the two is that u-way merging is used, instead of radix sorting, to locally sort 
the suffixes on each of the p' processors. Since the v-way merging procedure on n elements has 
the same asymptotic costs as the radix sorting procedure on an array of n strings each of size 
V, over an alphabet 2J = Z, then the local computation cost for this procedure is 0{vi^) and 
the communication cost is 0{\Di\^). Therefore, both these costs are O(^) in the first round and 
0{^^) in the critical round. Since each such set can be merged independently in parallel, then a 
constant number synchronisation steps is required. 

Note that, in the i*'' level of recursion, each step has local computation cost 0{vi^), com- 
munication cost 0{vi^) and 0(1) synchronisation costs. Also note that, as shown in table jsj 
0{vini) decreases super-exponentially in each successive level of recursion, and, therefore, the or- 
der of work done in each round of the BSP algorithm also decreases super-exponentially. Since 
the presented parallel suffix array construction algorithm is initially called on a string of size n 
with parameter u = 3, the algorithm has O(^) local computation and communication costs and 
requires O(loglogp) synchronisation steps. 



6 Conclusion 



In this paper we have presented a deterministic BSP algorithm for the construction of the sufRx 
array of a given string. The algorithm runs in optimal O(^) local computation and communication, 
and requires a near optimal O (log log p) synchronisation steps. 

The method of regular sampling in coarse-grained algorithms has been used to solve the sort- 
ing |14llj . and 2D and 3D convex hulls [16 problems. Random sampling has been used to solve 
the maximal matching problem and provide an approximation to the minimum cut problem [5] 
in a parallel context. An extension of the regular sampling technique, which we call accelerated 
sampling, was introduced by Tiskin JiT] to improve the synchronisation upper bound of the BSP 
algorithm for the selection problem. The same technique was used here to improve the synchroni- 
sation upper bounds of the sufhx array construction problem. Accelerated sampling is a theoret- 
ically interesting technique, allowing, in specific cases, for an exponential factor improvement in 
the number of synchronisation steps over existing algorithms. 

It is still an open question as to whether the synchronisation cost of the suffix array construction 
problem and the selection problem can be reduced to the optimal 0(1) while still having optimal 
local computation and communication costs. Another open question is whether further applications 
of the sampling technique, whether regular, random or accelerated, are possible. 
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